Bootstrapping Acoustic-Trawl Uncertainty Estimates

This notebook outlines how a variety of bootstrapping procedures, both parametric and non-parametric, can be put together to estimate the total uncertainty in an acoustic-trawl survey. It follows the basic steps used in our usual MACE survey analysis, which makes it easy to interpret in relation to our standard outputs. It also runs quickly, and unlike more advanced state-space models, does not need to maximize anything globally or converge to a posterior. Its disadvantage compared to those models is that it estimates the distribution of the otputs of a deterministic processing pipeline, which eliminates a few of the degrees of freedom a state-space model would have, and means the distributions of the results are a step removed from the underlying biology we're really trying to estimate.

Loading in data from the 2018 EBS survey and plotting the NASC and trawl locations:

One of our fundamental challenges is that we need to infer the density of fish across a continous large marine ecosystem (e.g. the EBS), but can only sample that ecosystem along discrete line transects separated by 10s of km, with a few point trawls along each transect. We work around this extremely sparse sampling by assuming that what we see along the transects--both the mean density of fish, and their spatial pattern--tells us about the range of possible densities that could exist between them.

For spatial variables that are more or less normally distributed, traditional geostatistical techniques like variography, Gaussian processes, and conditional simulation offer a standard way to do this. For non-Gaussian variables like acoustic backscatter, however, these techniques will not work out of the box. One workaround is to transform the raw variable (NASC, in our case) to make it approximately normal, then do Gaussian modeling and simulation on the transformed variable. However, this approach can get complicated quickly, since transforming the data also changes its variogram, and when transformed back onto its original scale, the spatial autocorrelation may not be correct. This "Gaussian anamorphosis" and associated problems formed a major part of Woillez et al.'s (2016) uncertainty analysis.

Conventional geostatistics (variography, Kriging, etc.) don't explicitly require normality--they just put things in terms of means and covariances (which make them easy to apply if we assume our spatial variable is multivariate normal).

$$ \mu = \mathbb{E}(\mathbf{x})\\ Q = \mathrm{cov}(\mathbf{x}) = \mathbb{E}(\mathbf{x}\mathbf{x}^T) $$

If $\mathbf{z}$ is a vector of uncorrelated random variables with variance 1, then $\mathrm{cov}(\mathbf{z}) = \mathbb{E}(\mathbf{z} \mathbf{z}^T) = I$, where $I$ is the identity matrix (because the elements of $\mathbf{z}$ are all uncorrelated with each other, all off-diagonal elements of its covariance matrix are zero).

If we want a vector of random numbers correlated according to an arbitrary covariance matrix $Q$, we can use the well-known "Cholesky trick." The Cholesky decomposition factorizes a (positive-definite) matrix into two triangular matrices as $Q = L L^T$.

$$ Q = L L^T = L I L^T = L \mathbb{E}(\mathbf{z} \mathbf{z}^T) L^T = \mathbb{E}(L \mathbf{z} \mathbf{z}^T L^T) = \mathbb{E}((L \mathbf{z}) (\mathbf{z}^T L^T)) = \mathbb{E}((L \mathbf{z}) (L \mathbf{z})^T), $$

which, from the definition of covariance, implies that

$$ \mathrm{cov}(L \mathbf{z}) = Q. $$

This means that the random vector $\mathbf{x} = L \mathbf{z}$ generated by multiplying $L$ with the uncorrelated random vector $\mathbf{z}$ will be correlated according to the covariance matrix $Q$. And since a variogram fitted to data gives us the theoretical correlation between any two spatial points, this allows us to calculate $Q$ and $L$, and to generate random fields with that spatial structure.

Another condition must be met as well: we want the simulated field to have the correct mean value at every point, interpolated between observations. This is easy if it is Gaussian, because the mean can be subtracted off, then added back after simulating a mean-zero correlated spatial field. With a non-negative field, this won't work. Instead, we need to set the mean of each element of $\mathbf{z}$ such that $\mathbf{x}$ has the desired expected value $\mathbf{\mu}_x$.

$$ \mathbf{\mu}_x = \mathbb{E}(L \mathbf{z}) = L \mathbb{E}(\mathbf{z}) \\ L^{-1} \mathbf{\mu}_x = \mathbb{E}(\mathbf{z}) \equiv \mathbf{\mu}_z $$

Our first step is to estimate the empirical variogram and fit a model to it. This model will then allow us to calculate $Q$ and $L$.

Next we have to define a spatial domain to do the simulation on. I find the convex hull of all the transects and define a 5 km x 5 km grid inside that polygon.

Setting up the geostatistical simulation problem, defining the matrix $L$ based on the variogram:

Unlike the normal distribution, non-negative distributions are usually not parameterized explicitly in terms of their mean and variance. For instance, the gamma distribution can have a "shape" parameter $k$ and "scale" parameter $\theta$. However, since these determine its mean and variance as $\mu = k \theta$ and $\sigma^2 = k \theta^2$, it's easy to solve for them given a desired $\mu$ and $\sigma^2$.

Here, we get $L$ and the interpolated mean $\mu_x$ out of the pre-processed parameter structure from the cell above. We force $\mu_x$ to have the same mean as the observed NASC, make sure it's all non-negative, and then solve the equation $$ \mu_z = L^{-1} \mu_x $$ to get the white-noise means. The white-noise variances are all set to 1.

Next, we define a couple of helper functions and doing the actual simulation. I'm using an inverse Gaussian distribution for the white noise, but there are others that I could try as well to see which gives the most realistic output.

Comparing this map of simulated NASC to the bubble plot at the top of the document, we can see that it does a decent job of interpolating the acoustic density between transects, while also allowing that density to fluctuate randomly.

I simulate 100 of these fields, and look at the distribution of their means, standard deviations, and 10th and 90th percentiles compared to the same statistics for the observed NASC (orange lines). The simulated fields look decently realistic:

Next, we need to load in the trawl data, georeference it, and calculate some trawl-by-trawl summaries, like the average backscattering cross-section, length, and weight. We also fit variogram models to these averaged quantities, which allow us to simulate autocorrelated values for them across the same simulation pixels we did for NASC.

So at every simulated point, we now have a NASC value and a $\langle \sigma_{bs} \rangle$ value, which is enough to get us an estimate of total numerical density. We next need to figure out how to allocate that density to age classes based on the proportions in the trawl.

I take a simple approach of resampling the the measured fish from the trawl, i.e. from the "scaling key source data" table in Macebase, and then calculating proportions-at-age from those bootstrapped tables. The cell below defines a few helper functions to do that.

The only challenging problem left is how to apply data from the trawls to the (simulated) acoustic backscatter values in between them. In the standard MACE procedure, we just assign each acoustic interval/cell to the nearest trawl, as illustrated here:

These plots show the distance of every simulated pixel from the nearest trawl, as well as a color-coded map showing the actual hard assignments.

We'd like to do this in a probabilistic way, though. One approach is to assign each pixel to a random trawl, and then repeat the random assignment a number of times. A simple random shuffling won't be appropriate; we'd like pixels to be assigned to nearby trawls more often than far-away trawls. We can accomplish this by weighting the probability of assignment to each trawl by the inverse distance to that trawl, with an exponent $a$ used to adjust the effective range of this weighting. I set $a$ such that an average pixel had about a 50% chance of being assigned to the nearest trawl at a range of about 20 km from that trawl. This was about half the typical distance between trawls, and also about the same as the variogram range for mean TS. While this method of setting $a$ and defining the range is maybe a bit ad-hoc, I haven't come up with anything better yet.

Defining some helper functions and then doing the random pixel assignment:

In the plot above, we can see the result of one of these random-assignment draws. There are still zones of influence around the trawls, visible as the blobs of uniform color, but the edges, where pixels are roughly equidistant from two or more trawls, are blurred by the random reassignment.

An alternative way of randomizing this part of the processing pipeline is to randomly jitter the trawl locations and reassign the pixels to the nearest jittered trawl:

The black points in each plot show the true trawl locations, but the region boundaries on the right reflect the nearest-neighbor assignment based on their locations plus a 20 km random jitter.

Before putting everything together, I define distributions for two other sources of error: the calibration uncertainty, which I set to 0.1 dB (debatable) and the proportion of water column NASC falling in the acoustic dead zone (also debatable).

Finally, putting all these steps together in a single loop of 100 bootstrap simulations. At every iteration, we simulate a NASC field conditional on the data, simulate a mean $\sigma_{bs}$ field conditional on the data, use those two variables to calculate a map of total numerical density, bootstrap all the trawl catches, and then divide up the total number of fish according to the bootstrapped proportions-at-age.

Plotting those distributions of abundance at age:

Looking at these results another way, we can calculate the C.V. of abundance at each age:

These are in the range of 4% for the most abundant age groups (4-6 year olds), and range up to 22% for the 10+ fish.

The total variance seems to be dominated by the variance of the most abundant age classes, and is also in the 4% range.